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cross reference to related applications 

[0001] This continuation application claims the benefit of U.S. Patent Application 

No. 09/347,947 for Video Bit Stream Extension with Supplemental Content Information 
to Aid in Subsequent Video Processing, filed July 6, 1999, Assignee Intel Corporation. 

BACKGROUND 

Field 

[0002] The present invention is related to the field of video processing, and more 

specifically, to the method and apparatus of video bit stream extension to aid in 
subsequent video processing. 

Related Art 

[0003] Currently, the motion picture and television industries utilize various 

video processing methods and apparatus to create special effects to captured video. 
Typically, video images are captured by some type of a video capture device, and 
recorded on a storage device. One or more video sequences from one or more video 
sources, such as a storage device, may be analyzed to collect information regarding each 
video sequence. The collected information is then utilized to manipulate the video 
sequence through, for example, some type of computer system, to create special effects. 
For example, video sequences from two different sources may be merged by video 
compositing to create special effects. In order to perform video compositing, information 
regarding each video sequence may be useful. 
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[0004] It is therefore desirable to provide a method and apparatus to generate an 

extended video bit stream for a video sequence. The desired extended video bit stream 
would contain non-visual information useful in several applications including but not 
limited to motion picture and television special effects, post processing, video editing and 
video library management. 
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BRIEF DESCRIPTION OF THE DRAWINGS 



[0005] Figure la is an exemplary system block diagram of a video source and 

video output video processing system according to one embodiment of the present 
invention. 

[0006] Figure lb illustrates a video processing information collection and 

annotation system according to one embodiment of the present invention. 
[0007] Figure 2 is a block diagram illustrating calibration software configured to 

determine camera geometry information from each frame and/or scene data of a video 
sequence made available through a video source device according to one embodiment of 
the present invention. 

[0008] Figure 3 is a block diagram of pose estimation software configured to 

determine camera pose information for each frame and/or scene data collected from a 
video source device according to one embodiment of the present invention. 
[0009] Figure 4 is a block diagram of illumination input software which is 

configured to accept illumination information for a given frame or scene data of a video 
sequence provided through a video source device according to one embodiment of the 
present invention. 

[0010] Figures 5a-5c illustrate an annotated video extended bit stream format 

according to one embodiment of the present invention. 

[0011] Figure 6 is a flow diagram illustrating the general steps followed by 

annotator software configured to annotate a video bit stream with certain video 
processing information according to one embodiment of the present invention. 
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DETAILED DESCRIPTION 

[0012] The present invention relates to a method and apparatus for producing an 

annotated video extended bit stream to aid video processing. In one embodiment of the 
present invention, certain video sequence information is collected on a per frame, per 
video segment or a per scene basis, as video data is being provided through a video 
source device. Video information is added to a video bit stream to generate an extended 
video bit stream with annotations (hereinafter referred to as an annotated video extended 
bit stream). 

[0013] The annotated video extended bit stream of the present invention includes 

both the original video sequence as well as certain additional non-visual information. 
The additional non-visual information is useful in subsequent processing of the video 
data provided by a video source device for several applications including special effect 
processing, easy access to video databases, efficient video coding, and easier generation 
of three-dimensional models. 

[0014] In one embodiment video and audio media are annotated with 

supplementary content for post video processing. Video data is accepted from a video 
source and video information associated with the video data is stored as the video data is 
being accepted. Then, the video information may be appended to the video data for later 
use in the form of annotations, for example. 

[0015] Prior art methods must rediscover the information which the present 

invention annotates by using only the available visual data. Further, the present 
invention solves mathematically ill posed problems for which solutions are not available. 
Additionally, the present invention efficiently provides access to data for which solutions 
are possible, but are obtained through great computational cost. The present invention 
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leads to better and faster compression via model building than prior art methods which 
rediscover the factors from the captured media only. 

[0016] Additionally, the present invention allows better digital manipulation of 

media. For example, three-dimensional models of objects in a scene can be constructed 
and used as search keys in a video database, and geometric and photometric properties 
can be used as derived search keys. Camera scene content is already often under 
program control. This control information as well as what can be sensed in the camera, 
such as a cameraman turning a zoom lens and camera pose, is conveyed to subsequent 
processing software through the annotated video extended bit stream of the present 
invention. 

[0017] Although the present invention is described for use in the motion picture 

and television industries, the present invention may also be useful in video cameras used 
by consumers to automatically or manually annotate certain parameters set during 
recording, such as zoom factor or focus. 

[0018] Figure la is an exemplary system block diagram illustrating a video 

source and video output system according to one embodiment of the present invention. 
In this example, the video processing information collection and annotation system 104 
may be implemented for use on a computer 102 with some type of a storage device 103 
coupled to a processor 105 by bus 107. The computer 102 may be coupled to a video 
source 100 configured to provide video data (video sequence) to the computer 102. In 
one embodiment, the video source 100 may be, but is not limited to, a video capture 
device 106 such as a camera, a video replay device 108 such as a video camera recorder 
(VCR) and a video transmission device 1 10 such as a digital or an analog television 
(TV). 
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[0019] The present invention's video processing information collection and 

annotation system 104 collects certain video processing information from video image 
data (frame or scene) as they are provided from the video source 100 and generates an 
annotated video extended bit stream comprised of the video image data as well as 
corresponding video processing information. 

[0020] Although the video processing information collection and annotation 

system 104 is illustrated as being implemented in a computer 102, a person skilled in the 
art provided with the detailed description herein may appreciate that the video processing 
information collection and annotation system 104 may also be implemented in the video 
source 100 such as a camera. 

[0021] Figure lb illustrates a video processing information collection and 

annotation system 104 according to one embodiment of the present invention. Video 
data 200 is provided by video source 100 in the form of frames (scenes) and forwarded to 
a collector 201. A calibration software 206, such as NCAMERA from REALVIZ of 
Sophia Antipolis Cedex, France, generates camera geometry information 204. The 
camera geometry information 204 may be generated either from the video data 200 
provided by the video source 100 or as the video data is being captured by a video 
capture device 106. The camera geometry information 204 is then forwarded the 
information on to the collector 201. Similarly, a pose estimation software 210, such as 
MATCHMOVER from REALVIZ of Sophia Antipolis Cedex, France, generates camera 
pose information 208. The camera pose information 208 may be generated from the 
video data 200 provided by the video source 100 or at the same time that the video data 
200 is being captured by a video capture device 106. The camera pose information 208 is 
then forwarded to the collector 201. 
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[0022] Source identification/description and illumination information is gathered 

by a source identification/description/illumination input software 212 from, for example, 
input by a camera operator (user) or a television director operating a television switcher 
(not shown). The information gathering may be fully automated by the television 
switcher. The gathered source identification/description/illumination information 21 1 is 
then forwarded along with the camera geometry information 204 and the camera pose 
information 208 to the collector 201. The collector 201 forwards the information 
collected to an annotator 202. 

[0023] The current state of the frame data and the gathered information are 

received and compared by a comparator 222. The gathered information and current state 
are also received by a differential generator 230 which determines the differential 
information. 

[0024] More specifically, the auxiliary information is gathered periodically but 

only encoded in the bit stream if there are changes. The current state indicates the latest 
encoded information in the bit stream and is kept in memory, e.g., state storage 240. The 
differential information is the difference between the current information and the 
gathered information. If the information remains the same, there is no need to annotate 
the bit stream. 

[0025] The differential information is encoded by an encoder 224 as auxiliary 

information for the current frame data. The annotator 202 stores the various information 
gathered in the video bit stream 200 by extending the original bit stream format. The 
annotated video extended bit stream 214 may be utilized by a video processor 216 to 
separate video data 218 from auxiliary information 220. The video data 218 is equivalent 
to video data 200 before processing by the annotator 202 and the auxiliary information 
220 is additional information gathered regarding the video data 200 including the camera 
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geometry information 204, the camera pose information 208 and the source 
identification/description/illumination information 21 1. 

[0026] Figure 2 is a block diagram illustrating the gathering of calibration 

information according to one embodiment of the present invention. In this example, 
camera geometry (intrinsic parameters) is provided by a camera calibration procedure 
which provides a camera projection matrix 270. In one embodiment, instrumentation 
attached to a set of actuators 250 is used to control and keep track of the geometry of 
cameras (e.g., auto-focus). Calibration is obtained by using a calibration box 260 of 
known size and shape with a distinct set of markers on its surface. The calibration 
software 206 utilizes the calibration box 260 to produce the camera projection matrix 
270. 

[0027] More specifically, in one embodiment, a video capture device 106 such as 

a camera causes a transformation to the incumbent light. This transformation depends on 
the physical arrangement of lenses and light sensors of the video capture device that is 
modeled by a 3 x 4 transformation matrix also referred herein as a camera projection 
matrix 270. The twelve parameters from the projection matrix define the camera 
geometry. Various methods for producing the projection matrix 270 are well known. 

[0028] Figure 3 is a block diagram illustrating the gathering of camera pose 

estimation information according to one embodiment of the present invention. The 
camera pose information 208 includes extrinsic camera parameters. More specifically, 
the images generated by a video capture device such as a camera depends on the position 
of the video capture device 106. The six degrees of freedom of camera pose may be 
defined with respect to a given reference. For example, the position of the camera on the 
very first frame may be used as a reference from which the camera pose for the 
subsequent frames may be measured. 
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[0029] Computation of the pose estimation information 208 may be made by the 

pose estimation software 210. Depth maps are estimated from a video sequence by using 
structure from motion. For example, the registration of the depth map to a three- 
dimensional model of the scene provides the pose information. The three-dimensional 
scene model may be constructed using the registered data in one embodiment. Obtaining 
pose estimation from video data is well known in the art and needs no further discussion. 

[0030] Alternatively, camera pose may be provided by a position tracking system 

based on gyroscopes, pendulums or magnetic field sensors (similar to those used in 
virtual reality helmets) installed in a video capture device 106 and provided to the 
collector 2d by the video capture device 106. Instrumentation attached to a set of 
actuators 250 (attached to the video capture device 106) may be used to remotely control 
the pose of robotic cameras remotely. 

[0031] Figure 4 is a block diagram illustrating the gathering of source 

identification/description and illumination information according to one embodiment of 
the present invention. Video source identification and description is provided by using 
information which is available either at a television (TV) switcher (live production) or 
editing A/B roll machine (off-line production). A TV director/editor defines an editing 
list which is implemented by an operator controlling the TV switcher 400 or A/B roll 
machine. 

[0032] All cameras and video sources are logged before actual editing. For future 

movies, this process is performed at the story board. Story board drawings and text have 
a one-to-one correspondence to video shoots. Individual shoot description may be 
annotated with the final video at production time. 
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[0033] 



Video source identification 402, which is part of the source 



identification/description/illumination information 211 delineates one parameter which 
defines a source number. More specifically, production video typically is composed 
from multiple video sources 106 where each clip may be identified at production time. 
Each clip has an associated parameter which defines a source number for the clip. 



identification/description/illumination information 21 1 refers to textual descriptions for 
distinct video sources. For instance, TV directors typically have textual descriptions for 
distinct video sources, such as for a National Basketball Association (NBA) game. 
"Scoreboard shoot", "court overview", "action on court", "close-up on coach", or "close- 
up of commentator" are different source descriptions. 

[0035] Finally, the illumination condition 403, also part of the source 

identification/description/illumination information 211, refers to computer controlled sets 
and lighting system 404. The controls may be annotated as part of the scene description 
as well. For insertion of computer-generated actors in a scene, lighting information is 
fundamental on a per- frame basis. 

[0036] Figures 5a - 5c illustrate an exemplary annotated video extended bit 

stream format according to one embodiment of the present invention. The exemplary 
annotated video extended bit stream 500 illustrated in Figure 5a has ancillary 
information 502 which is a list of parameters of arbitrary length. The parameters are 
described in more detail in a corresponding entry in a tag lookup table 503 illustrated in 
Figure 5c. In one embodiment, the tag lookup table 503 resides in storage device 103. 
The parameters having the information described in Figures 2, 3 and 4 are valid until a 
new setting appears in the annotated video extended bit stream 500. 



[0034] 



The video source description 405 of the source 
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[0037] Figure 6 is a flow diagram illustrating the general steps followed by 

annotator software configured to annotate a video bit stream with certain video 
processing information according to one embodiment of the present invention. In step 
601, frame data is obtained from a video source. In step 602, information regarding the 
frame data, including camera geometry 204, camera pose 208 and source identification 
description illumination software 21 1 is gathered. In step 603, gathered information is 
stored in a state storage device. In step 604, the current state of the frame data and the 
gathered information is compared by a comparator 222. 

[0038] In step 605, differential information is generated according to the results 

of the comparison. More specifically, the auxiliary information is gathered periodically 
but only encoded in the bit stream if there are changes. The current state indicates the 
latest encoded information in the bit stream and is kept in memory. The differential 
information is the difference between the current information and the gathered 
information. If the information remains the same, there is no need to annotate the bit 
stream. 

[0039] In step 606, the differential information is encoded by an encoder 224 as 

auxiliary information for the current frame data. In step 607, the auxiliary information is 
annotated to the encoded frame data to generate an annotated video extended bit stream 
214. 

[0040] What has been described is a method and apparatus for annotating video 

and audio media with supplementary content for subsequent video processing. Prior art 
methods must rediscover the information which the present invention annotates by using 
only the available visual data. The present invention solves mathematically ill posed 
problems for which solutions are not available. Additionally, the present invention 
efficiently provides access to data for which solutions are possible, but are obtained 
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through great computational cost. The present invention leads to better and faster 
compression via model building than prior art methods which rediscover the factors from 
the captured media only. 

[0041] While certain exemplary embodiments have been described in detail and 

shown in the accompanying drawings, it is to be understood that such embodiments are 
merely illustrative of and not restrictive on the broad invention, and that this invention is 
not to be limited to the specific arrangements and constructions shown and described, 
since various other modifications may occur to those with ordinary skill in the art. 
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